13 research outputs found
A Closer Look at Scoring Functions and Generalization Prediction
Generalization error predictors (GEPs) aim to predict model performance on
unseen distributions by deriving dataset-level error estimates from
sample-level scores. However, GEPs often utilize disparate mechanisms (e.g.,
regressors, thresholding functions, calibration datasets, etc), to derive such
error estimates, which can obfuscate the benefits of a particular scoring
function. Therefore, in this work, we rigorously study the effectiveness of
popular scoring functions (confidence, local manifold smoothness, model
agreement), independent of mechanism choice. We find, absent complex
mechanisms, that state-of-the-art confidence- and smoothness- based scores fail
to outperform simple model-agreement scores when estimating error under
distribution shifts and corruptions. Furthermore, on realistic settings where
the training data has been compromised (e.g., label noise, measurement noise,
undersampling), we find that model-agreement scores continue to perform well
and that ensemble diversity is important for improving its performance.
Finally, to better understand the limitations of scoring functions, we
demonstrate that simplicity bias, or the propensity of deep neural networks to
rely upon simple but brittle features, can adversely affect GEP performance.
Overall, our work carefully studies the effectiveness of popular scoring
functions in realistic settings and helps to better understand their
limitations.Comment: Accepted to ICASSP 202
PAGER: A Framework for Failure Analysis of Deep Regression Models
Safe deployment of AI models requires proactive detection of potential
prediction failures to prevent costly errors. While failure detection in
classification problems has received significant attention, characterizing
failure modes in regression tasks is more complicated and less explored.
Existing approaches rely on epistemic uncertainties or feature inconsistency
with the training distribution to characterize model risk. However, we show
that uncertainties are necessary but insufficient to accurately characterize
failure, owing to the various sources of error. In this paper, we propose PAGER
(Principled Analysis of Generalization Errors in Regressors), a framework to
systematically detect and characterize failures in deep regression models.
Built upon the recently proposed idea of anchoring in deep models, PAGER
unifies both epistemic uncertainties and novel, complementary non-conformity
scores to organize samples into different risk regimes, thereby providing a
comprehensive analysis of model errors. Additionally, we introduce novel
metrics for evaluating failure detectors in regression tasks. We demonstrate
the effectiveness of PAGER on synthetic and real-world benchmarks. Our results
highlight the capability of PAGER to identify regions of accurate
generalization and detect failure cases in out-of-distribution and
out-of-support scenarios
Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks
Safe deployment of graph neural networks (GNNs) under distribution shift
requires models to provide accurate confidence indicators (CI). However, while
it is well-known in computer vision that CI quality diminishes under
distribution shift, this behavior remains understudied for GNNs. Hence, we
begin with a case study on CI calibration under controlled structural and
feature distribution shifts and demonstrate that increased expressivity or
model size do not always lead to improved CI performance. Consequently, we
instead advocate for the use of epistemic uncertainty quantification (UQ)
methods to modulate CIs. To this end, we propose G-UQ, a new single
model UQ method that extends the recently proposed stochastic centering
framework to support structured data and partial stochasticity. Evaluated
across covariate, concept, and graph size shifts, G-UQ not only
outperforms several popular UQ methods in obtaining calibrated CIs, but also
outperforms alternatives when CIs are used for generalization gap prediction or
OOD detection. Overall, our work not only introduces a new, flexible GNN UQ
method, but also provides novel insights into GNN CIs on safety-critical tasks.Comment: 22 pages, 11 figure
Analyzing Data-Centric Properties for Graph Contrastive Learning
Recent analyses of self-supervised learning (SSL) find the following
data-centric properties to be critical for learning good representations:
invariance to task-irrelevant semantics, separability of classes in some latent
space, and recoverability of labels from augmented samples. However, given
their discrete, non-Euclidean nature, graph datasets and graph SSL methods are
unlikely to satisfy these properties. This raises the question: how do graph
SSL methods, such as contrastive learning (CL), work well? To systematically
probe this question, we perform a generalization analysis for CL when using
generic graph augmentations (GGAs), with a focus on data-centric properties.
Our analysis yields formal insights into the limitations of GGAs and the
necessity of task-relevant augmentations. As we empirically show, GGAs do not
induce task-relevant invariances on common benchmark datasets, leading to only
marginal gains over naive, untrained baselines. Our theory motivates a
synthetic data generation process that enables control over task-relevant
information and boasts pre-defined optimal augmentations. This flexible
benchmark helps us identify yet unrecognized limitations in advanced
augmentation techniques (e.g., automated methods). Overall, our work rigorously
contextualizes, both empirically and theoretically, the effects of data-centric
properties on augmentation strategies and learning paradigms for graph SSL.Comment: Accepted to NeurIPS 202
Fairness-Aware Graph Neural Networks: A Survey
Graph Neural Networks (GNNs) have become increasingly important due to their
representational power and state-of-the-art predictive performance on many
fundamental learning tasks. Despite this success, GNNs suffer from fairness
issues that arise as a result of the underlying graph data and the fundamental
aggregation mechanism that lies at the heart of the large class of GNN models.
In this article, we examine and categorize fairness techniques for improving
the fairness of GNNs. Previous work on fair GNN models and techniques are
discussed in terms of whether they focus on improving fairness during a
preprocessing step, during training, or in a post-processing phase.
Furthermore, we discuss how such techniques can be used together whenever
appropriate, and highlight the advantages and intuition as well. We also
introduce an intuitive taxonomy for fairness evaluation metrics including
graph-level fairness, neighborhood-level fairness, embedding-level fairness,
and prediction-level fairness metrics. In addition, graph datasets that are
useful for benchmarking the fairness of GNN models are summarized succinctly.
Finally, we highlight key open problems and challenges that remain to be
addressed